mbpp+

p-values

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.

Typical delta to give good p-values

We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.

Pairwise wins (including ties)

Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.

Result table

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.

model pass1 win_rate elo
0 gpt-4-1106-preview 0.733 0.814 1285.135
1 meta-llama-3-70b-instruct 0.690 0.775 1244.763
2 opencodeinterpreter-ds-33b 0.685 0.756 1225.049
3 white-rabbit-neo-33b-v1 0.669 0.722 1191.603
4 opencodeinterpreter-ds-6.7b 0.664 0.721 1193.917
5 deepseek-coder-6.7b-instruct 0.656 0.696 1167.416
6 xwincoder-34b 0.648 0.683 1156.001
7 bigcode--starcoder2-15b-instruct-v0.1 0.651 0.679 1149.482
8 HuggingFaceH4--starchat2-15b-v0.1 0.646 0.672 1145.369
9 mixtral-8x22b-instruct-v0.1 0.643 0.664 1138.261
10 starcoder2-15b-oci 0.632 0.652 1128.275
11 CohereForAI--c4ai-command-r-plus 0.635 0.649 1125.543
12 speechless-starcoder2-15b 0.624 0.635 1116.710
13 Qwen--Qwen1.5-72B-Chat 0.616 0.618 1100.714
14 deepseek-coder-6.7b-base 0.587 0.566 1056.786
15 codegemma-7b-it 0.569 0.534 1030.377
16 speechless-starcoder2-7b 0.563 0.512 1014.060
17 databricks--dbrx-instruct 0.558 0.505 1004.553
18 microsoft--Phi-3-mini-4k-instruct 0.542 0.479 982.766
19 codegemma-7b 0.524 0.458 966.646
20 octocoder 0.497 0.413 929.118
21 mixtral-8x7b-instruct 0.497 0.411 926.871
22 codegemma-2b 0.466 0.371 893.445
23 open-hermes-2.5-code-290k-13b 0.458 0.356 880.735
24 gemma-1.1-7b-it 0.450 0.340 865.755
25 starcoder2-3b 0.439 0.323 851.507
26 gemma-7b 0.434 0.319 848.045
27 codegen-6b 0.429 0.309 837.629
28 mistral-7b 0.421 0.292 820.887
29 codet5p-2b 0.381 0.244 774.521
30 mistralai--Mistral-7B-Instruct-v0.2 0.370 0.236 766.272
31 codegen-2b 0.360 0.211 741.190
32 gemma-7b-it 0.328 0.193 718.705
33 gemma-2b 0.341 0.193 721.893